Goto

Collaborating Authors

 n-gram statistics


Understanding Transformers via N-Gram Statistics

Neural Information Processing Systems

Transformer based large-language models (LLMs) display extreme proficiency with language yet a precise understanding of how they work remains elusive. One way of demystifying transformer predictions would be to describe how they depend on their context in terms of simple template functions. This paper takes a first step in this direction by considering families of functions (i.e. By studying how well these rulesets approximate transformer predictions, we obtain a variety of novel discoveries: a simple method to detect overfitting during training without using a holdout set, a quantitative measure of how transformers progress from learning simple to more complex statistical rules over the course of training, a model-variance criterion governing when transformer predictions tend to be described by N-gram rules, and insights into how well transformers can be approximated by N-gram rulesets in the limit where these rulesets become increasingly complex. In this latter direction, we find that for 79% and 68% of LLM next-token distributions on TinyStories and Wikipedia, respectively, their top-1 predictions agree with those provided by our N-gram rulesets.


Efficient Training of Language Models with Compact and Consistent Next Token Distributions

arXiv.org Artificial Intelligence

Maximizing the likelihood of the next token is an established, statistically sound objective for pre-training language models. In this paper we show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$-gram distribution. Previous studies have proposed corpus-level $n$-gram statistics as a regularizer; however, the construction and querying of such $n$-grams, if done naively, prove to be costly and significantly impede training speed, thereby limiting their application in modern large language model pre-training. We introduce an alternative compact representation of the next token distribution that, in expectation, aligns with the complete $n$-gram distribution while markedly reducing variance across mini-batches compared to the standard next-token loss. Empirically, we demonstrate that both the $n$-gram regularized model and our approximation yield substantial improvements in model quality and convergence rate compared to existing methods. Furthermore, our approximation facilitates scalability of gains to larger datasets and models compared to the straightforward $n$-gram regularization method.


Understanding Transformers via N-gram Statistics

arXiv.org Artificial Intelligence

Transformer based large-language models (LLMs) display extreme proficiency with language yet a precise understanding of how they work remains elusive. One way of demystifying transformer predictions would be to describe how they depend on their context in terms of simple template functions. This paper takes a first step in this direction by considering families of functions (i.e. rules) formed out of simple N-gram based statistics of the training data. By studying how well these rulesets approximate transformer predictions, we obtain a variety of novel discoveries: a simple method to detect overfitting during training without using a holdout set, a quantitative measure of how transformers progress from learning simple to more complex statistical rules over the course of training, a model-variance criterion governing when transformer predictions tend to be described by N-gram rules, and insights into how well transformers can be approximated by N-gram rulesets in the limit where these rulesets become increasingly complex. In this latter direction, we find that for 78% of LLM next-token distributions on TinyStories, their top-1 predictions agree with those provided by our N-gram rulesets.


CMU's ASR2K Pipeline Recognizes Speech in 1909 Languages Without Audio

#artificialintelligence

AI-powered speech recognition systems have made great progress in recent years, with speech-to-text processing now so powerful that the occasional errors are little more than curious exceptions. Most contemporary models addressing this task however require massive labelled training data -- which is simple enough to source for English, Chinese, and other popular languages but challenging to obtain in the case of the low-resource tongues that make up the majority of the world's 8,000 languages. To address this issue, a Carnegie Mellon University research team has developed a speech recognition pipeline that can recognize 1909 languages without any audio for the target language. Their ASR2K pipeline achieves impressive 45 percent CER and 69 percent WER scores when using 10,000 raw text utterances on the CMU Wilderness dataset, and is introduced in the paper ASR2K: Speech Recognition for Around 2000 Languages Without Audio. The proposed pipeline comprises separate acoustic, pronunciation, and language models.


ASR2K: Speech Recognition for Around 2000 Languages without Audio

arXiv.org Artificial Intelligence

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.


HyperSeed: Unsupervised Learning with Vector Symbolic Architectures

arXiv.org Artificial Intelligence

Across all experiments, Hyperseed convincingly machine learning and robotics context is currently gaining a demonstrates its key novelties of learning from a few input great momentum [1]-[6]. In classification tasks, the use of vectors and single vector operation learning rule, both of which VSA leads to order of magnitude increase in energy efficiency contribute towards reduced time and computation complexity. of computations on the one hand and natively enables oneshot The paper is structured as follows. Section II describes and multitask learning on the other [7]. It is prospected the related work relevant to Hyperseed operations. The used that VSA will play a key role in the development of novel methods including the fundamentals of VSA are presented neuromorphic computer architectures [8] as an algorithmic in Section III. Section IV presents the main contribution - abstraction [9], [10]. The main contribution of this paper is the method for unsupervised learning Hyperseed. Section V a novel algorithm for unsupervised learning called Hyperseed, reports the results of the performance evaluation the experiments.


Detecting and Exorcising Statistical Demons from Language Models with Anti-Models of Negative Data

arXiv.org Artificial Intelligence

It's been said that "Language Models are Unsupervised Multitask Learners." Indeed, self-supervised language models trained on "positive" examples of English text generalize in desirable ways to many natural language tasks. But if such models can stray so far from an initial self-supervision objective, a wayward model might generalize in undesirable ways too, say to nonsensical "negative" examples of unnatural language. A key question in this work is: do language models trained on (positive) training data also generalize to (negative) test data? We use this question as a contrivance to assess the extent to which language models learn undesirable properties of text, such as n-grams, that might interfere with the learning of more desirable properties of text, such as syntax. We find that within a model family, as the number of parameters, training epochs, and data set size increase, so does a model's ability to generalize to negative n-gram data, indicating standard self-supervision generalizes too far. We propose a form of inductive bias that attenuates such undesirable signals with negative data distributions automatically learned from positive data. We apply the method to remove n-gram signals from LSTMs and find that doing so causes them to favor syntactic signals, as demonstrated by large error reductions (up to 46% on the hardest cases) on a syntactic subject-verb agreement task.